-
Notifications
You must be signed in to change notification settings - Fork 0
Add xarray Dataset/DataArray support to ACCESS_ESM_CMORiser #145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Add new `input_data` parameter to accept xarray Dataset or DataArray objects - Maintain full backward compatibility with existing `input_paths` parameter - Automatically convert DataArrays to Datasets for processing - Skip frequency validation for xarray inputs (data already loaded) - Update all CMORiser subclasses (Atmosphere, Ocean OM2/OM3) to support new interface - Preserve all existing functionality (resampling, chunking, validation) - Add comprehensive parameter validation and deprecation warnings This enables in-memory processing workflows and integration with xarray-based analysis pipelines while maintaining compatibility with existing file-based workflows. All existing tests pass (34/34) confirming no breaking changes.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #145 +/- ##
==========================================
- Coverage 57.55% 52.35% -5.21%
==========================================
Files 18 18
Lines 2403 2722 +319
==========================================
+ Hits 1383 1425 +42
- Misses 1020 1297 +277 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
rhaegar325
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rbeucher for this PR, the overall structure is very clear and the implementation is well organised. The design aligns nicely with the existing CMORisation workflow, and after some basic testing I didn’t observe any obvious bugs.
There are a few edge cases around bounds handling that might be worth keeping in mind for future improvements:
For ocean variables, the current time_bnds handling is not yet fully covered. When input_data is provided as a DataArray, time_bnds can be missing, which may lead to runtime errors. In practice, this means the workflow relies on a helper function that automatically derives time_bnds from time.
A similar pattern appears for atmosphere variables with spatial bounds. For example, if bounds variables are not present, errors such as
KeyError: "No variable named 'lon_bnds'. Variables on the dataset include ['lon', 'time', 'lat', 'pr']"
can occur.
Additionally, if the dataset is loaded without decode_cf=False, time coordinates may be decoded as cftime objects, which can trigger a TypeError in atmosphere.py (around line 174) when they are implicitly cast to numeric types:
TypeError: float() argument must be a string or a real number, not 'cftime._cftime.DatetimeProlepticGregorian'.
These don’t look like blockers for this PR, but it might be useful to document these assumptions or add some lightweight safeguards around bounds generation and time handling in a follow-up.
| ) | ||
| if resampling_required: | ||
|
|
||
| # Keep only required data variables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropped redundant coordinates and dimensions to prevent them from affecting other parts of the workflow. These steps were previously handled implicitly by xr.open_mfdataset() and now need to be handled explicitly.
| used_dims.update(self.ds[var].dims) | ||
|
|
||
| # Exclude auxiliary time dimension | ||
| if "time_0" in used_dims: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time_0 is a special coords and need to be handled specificly.
| self.ds = self.chunker.rechunk_dataset(self.ds) | ||
| print("✅ Dataset rechunking completed") | ||
|
|
||
| def _ensure_numeric_time_coordinates(self, ds: xr.Dataset) -> xr.Dataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Method to convert cftime format to numeric values
| return ds_resampled, True | ||
|
|
||
|
|
||
| def calculate_time_bounds( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 methods for calculate time_bnds, lat_bnds and lon_bnds
| ) # Make a copy to avoid modifying original | ||
|
|
||
| # SAFEGUARD: Convert cftime coordinates to numeric if present | ||
| self.ds = self._ensure_numeric_time_coordinates(self.ds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a safeguard to handle cases where the input data use cftime in time-related coordinates and variables.
Add automatic calculation of missing coordinate bounds and fix cftime handling for CMIP6 complianceThis PR implements automatic calculation of missing coordinate bounds ( Changes Made1. cftime to numeric conversion safeguard
2. Coordinate cleanup improvements
3. New utility functions in
|
rhaegar325
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. cftime to numeric conversion safeguard
- Added
_ensure_numeric_time_coordinates()method to convert cftime objects to numeric values when datasets are loaded withdecode_cf=True - Prevents
TypeError: float() argument must be a string or a real number, not 'cftime._cftime.DatetimeProlepticGregorian'in downstream operations - Preserves time encoding attributes (
units,calendar) after conversion - Uses default
units='days since 0001-01-01'when missing
2. Coordinate cleanup improvements
- Implemented proper removal of unused coordinates in
load_dataset() - Added handling for auxiliary dimensions (e.g.,
time_0) viaisel(time_0=0, drop=True) - Only keeps coordinates that are actually used as dimensions by data variables
- Prevents dimension mismatch errors in transpose operations
3. New utility functions in utilities.py
calculate_latitude_bounds(): Calculates latitude bounds for both regular and irregular grids, with proper handling of polar boundaries (clipping to [-90°, 90°])calculate_longitude_bounds(): Calculates longitude bounds supporting both 0-360° and -180-180° conventions, with automatic detection of global vs. regional grids and proper handling of periodic boundaries
4. Enhanced calculate_time_bounds() function
- Added
time_coordparameter to support different time coordinate names - Added
bnds_nameparameter to support different bounds dimension names ("nv" for ocean, "bnds" for atmosphere)
5. Updated CMIP6_Atmosphere_CMORiser class
- Added automatic detection of missing bounds variables during
select_and_process_variables() - Automatically calculates missing bounds using the appropriate utility function based on coordinate type
- Issues user warnings when bounds are missing from raw data and being auto-calculated
- Maintains flexibility for different coordinate naming conventions (lat/latitude/y, lon/longitude/x, time/t)
Those changes allow xarray.Dataarray could be used as an input format for Moppy
Overview
This PR adds support for using xarray
DatasetandDataArrayobjects as direct inputs toACCESS_ESM_CMORiser, enabling in-memory processing workflows while maintaining full backward compatibility with existing file-based workflows.Problem Statement
The current CMORiser assumes a list of files as input:
This creates limitations for workflows where:
Solution
New
input_dataParameterAdded a new
input_dataparameter that accepts:Usage Examples
With xarray Dataset
With xarray DataArray
Backward Compatibility
Existing code continues to work unchanged: